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FlatCam is a thin form-factor lensless camera that consists of a coded mask placed on top of a bare, conventional sensor array. 
Unlike a traditional, lens-based camera where an image of the scene is directly recorded on the sensor pixels, each pixel in FlatCam 
records a linear combination of light from multiple scene elements. A computational algorithm is then used to demultiplex the 
recorded measurements and reconstruct an image of the scene. FlatCam is an instance of a coded aperture imaging system; however, 
unlike the vast majority of related work, we place the coded mask extremely close to the image sensor that can enable a thin system. 
We employ a separable mask to ensure that both calibration and image reconstruction are scalable in terms of memory requirements 
and computational complexity. We demonstrate the potential of the FlatCam design using two prototypes: one at visible wavelengths 
and one at infrared wavelengths. 


1. Introduction 

A range of new imaging applications is driving the minia¬ 
turization of cameras. As a consequence, significant progress 
has been made towards minimizing the total volume of the 
camera, which has enabled new applications in endoscopy, pill 
cameras, and in vivo microscopy. Unfortunately, this strategy 
of miniaturization has an important shortcoming: the amount 
of light collected at the sensor decreases dramatically as 
the lens aperture and the sensor size become smaller. As a 
consequence, ultra-miniature imagers built simply by scaling 
down the optics and sensors suffer from extremely low light 
collection. 

In this paper, we present a camera architecture that we 
call ElatCam, which is inspired by coded aperture imaging 
principles pioneered in astronomical X-ray and gamma-ray 
imaging Q-0. Our proposed ElatCam design uses a large 
photosensitive area with a very thin form factor. The ElatCam 
achieves thin form factor by dispensing with a lens and replac¬ 
ing it with a coded, binary mask placed almost immediately 
atop a bare conventional sensor array. The image formed on 
the sensor can be viewed as a superposition of many pinhole 
images. Thus, the light collection ability of such a coded 
aperture system is proportional to the size of the sensor and 
the transparent regions (pinholes) in the mask. In contrast, the 
light collection ability of a miniature, lens-based camera is 
limited by the lens aperture size, which is restricted by the 
requirements on the device thickness. 

An illustration of the ElatCam design is presented in Eig. 
Light from a scene passes through a coded mask and lands on 
a conventional image sensor. The mask consists of opaque and 
transparent features (to block and transmit light, respectively); 
each transparent feature can be viewed as a pinhole. Light 
from the scene gets diffracted by the mask features such that 
light from each scene location casts a unique mask shadow on 
the sensor, and this mapping can be represented using a linear 
operator. A computational algorithm then inverts this linear 
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operator to recover the original light distribution of the scene 
from the sensor measurements. 

Our ElatCam design has many attractive properties besides 
its slim profile. Eirst, since it reduces the thickness of the 
camera but not the area of the sensor, it collects more light than 
miniature, lens-based cameras with same thickness. Second, 
the mask can be created from inexpensive materials that 
operate over a broad range of wavelengths. Third, the mask can 
be fabricated simultaneously with the sensor array, creating 
new manufacturing efficiencies. The mask can be fabricated 
either directly in one of the metal interconnect layers on top 
of the photosensitive layer or on a separate wafer thermal 
compression that is bonded to the back side of the sensor, 
as is typical for back-side illuminated image sensors |[^. 

We demonstrate the potential of the ElatCam using two 
prototypes built in our laboratory with commercially available 
sensors and masks: a visible prototype in which the mask¬ 
sensor spacing is about 0.5mm and a short-wave infrared 
(SWIR) prototype in which the spacing is about 5mm. Eigures 
|7] and [TT] illustrate sensor measurements and reconstructed 
images using our prototype ElatCams. 

II. Related work 

Pinhole cameras. Imaging without a lens is not a new idea. 
Pinhole cameras, the progenitor of lens-based cameras, have 
been well known since Alhazen (965-1039AD) and Mozi 
(c. 370BCE). However, a tiny pinhole drastically reduces the 
amount of light reaching the sensor, resulting in noisy, low- 
quality images. Indeed, lenses were introduced into cameras 
for precisely the purpose of increasing the size of the aperture, 
and thus the light throughput, without degrading the sharpness 
of the acquired image. 

Coded apertures. Coded aperture cameras extend the idea 
of a pinhole camera by using masks with multiple pinholes 
0^ 0- The primary goal of coded aperture cameras is to 

increase the light throughput compared to a pinhole camera. 
Eigure summarizes some salient features of pinhole, lens- 
based, and ElatCam (coded mask-based) architectures. 
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Fig. 1 : FlatCam architecture, (a) Every light source within the camera field-of-view contributes to every pixel in the multiplexed image 
formed on the sensor. A computational algorithm reconstructs the image of the scene. Inset shows the mask-sensor assembly of our prototype 
in which a binary, coded mask is placed 0.5mm away from an off-the-shelf digital image sensor, (b) An example of sensor measurements 
and the image reconstructed by solving a computational inverse problem. 


Coded-aperture cameras have traditionally been used for 
imaging wavelengths beyond the visible spectrum (e.g., x- 
ray and gamma-ray imaging), for which lenses or mirrors 
are expensive or infeasible 0 0 0 0 0 Mask-based 
lens-free designs have also been proposed for flexible fleld-of- 
view selection in compressive single-pixel imaging using a 
transmissive LCD panel |[^, and separable coded masks pQ| . 

In recent years, coded masks and light modulators have also 
been added to lens-based cameras in different conflgurations 
to build novel imaging devices that can capture image and 
depth m dynamic video eg, or 4D lightfleld (Tg, (ig 
from a single coded image. Coded aperture-based systems 
using compressive sensing principl es |T5| (TV) have also been 
studied for image super-resolution |I8|, spectral imaging 
and video capture p0| . 

Existing coded aperture-based lensless systems have two 
main limitations: First, the large body of work devoted to 
coded apertures invariably place the mask signiflcantly far 
away from the sensor (e.g., 65mm distance in |T0|). In 
contrast, our FlatCam design offers a thin form factor. For 
instance, in our prototype with a visible sensor, the spacing 
between the sensor and the mask is only 0.5mm. Second, the 
masks employed in some designs have transparent features 
only in a small central region whose area is invariably much 
smaller than the area of the sensor. In contrast, almost half 
of the features (spread across the entire surface) in our mask 
are transparent. As a consequence, the light throughput of our 
designs are many orders of magnitude larger as compared to 
previous designs. Furthermore, the lensless cameras proposed 
in |T0| use programmable spatial light modulators (SLM) 
and capture multiple images while changing the mask patterns. 
In contrast, we use a static mask in our design, which can 
potentially be flxed on the sensor during fabrication or the 
assembly process. 

Camera arrays. A number of thin imaging systems have 
been developed over the last few decades. The TOMBO 
architecture ED, inspired by insect compound eyes, reduces 
the camera thickness by replacing a single, large focal-length 
lens with multiple, small focal-length microlenses. Each mi¬ 
crolens and the sensor area underneath it can be viewed as 
a separate low-resolution, lens-based camera, and a single 


high-resolution image can be computationally reconstructed 
by fusing ah of the sensor measurements. Similar architectures 
have been used for designing thin infrared cameras The 
camera thickness in this design is dictated by the geometry 
of the microlenses; reducing the camera thickness requires 
a proportional reduction in the sizes of the microlenses and 
sensor pixels. As a result, microlens-based cameras currently 
offer only up to a four-fold reduction in the camera thickness 



Folded optics. An alternate approach for achieving thin 
form factors relies on folded optics, where light manipulation 
similar to that of a traditional lens is achieved using multi-fold 
reflective optics | [25| . However, folded optics based systems 
have low light collection efficiencies. 

Ultra-miniature lensless imaging with diffraction gratings. 
Recently, miniature cameras with integrated diffraction grat¬ 
ings and CMOS image sensors have been developed |[26|- 
These cameras have been successfully demonstrated on 
tasks such as motion estimation and face detection. While 
these cameras are indeed ultra-miniature in total volume (100 
micron sensor width by 200 micron thickness), they retain 
the large thickness-to-width ratio of conventional lens-based 
cameras. Because of the small sensor size, they suffer from 
reduced light collection ability. In contrast, in our visible 
prototype below, we used a 6.7mm wide square sensor, which 
increases the amount of light collection by about three orders 
of magnitude, while the device thickness remains approxi¬ 
mately similar (500 micron). 

Lensfree microscopy and shadow imaging. Lensfree cameras 
have been successfully demonstrated for several microscopy 
and lab-on chip application, wherein the subject to be imaged 
is close to the image sensor. An on-chip, lens-free microscopy 
design that uses amplitude masks to cast a shadow of point 
illumination sources onto a microscopic tissue sample has 
shown signifleant promise for microscopy and related appli¬ 
cations, where the sample being imaged is very close to the 
sensor (less than I mm) p0| , ED- Unfortunately, this tech¬ 
nique cannot be directly extended to traditional photography 
and other applications that require larger standoff distances 
and do not provide control over illumination. 
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Fig. 2: Comparison of pinhole, lens-based, and coded mask-based cameras. Pinhole cameras and lens-based cameras provide one-to-one 
mapping between light from a focal plane and the sensor plane (note that light from three different directions is mapped to three distinct 
locations on the sensor), but the coded mask-based cameras provide a multiplexed image that must be resolved using computation. The table 
highlights some salient properties of the three camera designs. Pinholes cameras suffer from very low light throughput, while lens-based 
cameras are bulky and rigid because of their optics. In contrast, the FlatCam design offers thin, light-efficient cameras with the potential for 
direct fabrication. 


III. FlatCam design 

Our FlatCam design places an amplitude mask almost 
immediately in front of the sensor array (see Fig. [^. We 
assume that the sensor and the mask are planar, parallel to 
each other, and separated by distance d. While we focus on 
a single mask for exposition purposes, the concept extends 
to multiple amplitude masks in a straightforward manner. For 
simplicity of explanation, we also assume (without loss of 
generality) that the mask modulates the impinging light in a 
binary fashion; that is, it consists of transparent features that 
transmit light and opaque features that block light. We denote 
the size of the transparent/opaque features by A and assume 
that the mask covers the entire sensor array. 

Consider the one-dimensional coded aperture system de¬ 
picted in Fig. in which a single coded mask is placed at 
distance d from the sensor plane. We assume that the field-of- 
view (FOV) of each sensor pixel is limited by a chief ray angle 
(CRA) 6 >cra, which implies that every pixel receives light only 
from the angular directions that he within ±6 >cra with respect 
to its surface normal. Therefore, light rays entering any pixel 
are modulated by the mask pattern of length w = 2(itan 6>cra- 
As we increase (or decrease) the mask-to-sensor distance, d, 
the width of the mask pattern, w, also increases (or decreases). 
Assuming that the scene is far from the camera, the mask 
patterns for neighboring pixels shift by the same amount as 
the pixel width. Therefore, if we assume that the mask features 
and the sensor pixels have the same width. A, then the mask 
patterns for neighboring pixels shift by exactly one mask 
element. Furthermore, if we fix (i ^ AA/2 tan6>CRA, then 
exactly N mask features he within the FOV of each pixel. If 
the mask is designed by repeating a pattern of N features, then 
the linear system that maps the light distribution in the scene 
to the sensor measurements can be represented as a cyclic 
convolution. 

A number of mask patterns have been introduced in the 



Fig. 3: An illustration of a coded aperture system with a mask 
placed d units away from the sensor plane. Each pixel records light 
from angular directions within zL^cra- Light reaching each sensor is 
modulated by the mask pattern that is w = 2d tan ^cra units wide, 
which we can increase (or decrease) by moving the mask farther (or 
closer) to the sensor. 

literature that offer high light collection and robust image 
reconstruction for circulant systems. Typical examples include 
uniform redundant array (URA), modified UR A (MURA), 
and pseudo-noise pattens such as maximum length sequences 
(MLS or M-sequences) Q, |[^, p2|-p4|. One key property 
of these patterns is that they have near-flat Fourier spectrum. 
Furthermore, coded aperture systems have been conventionally 
used for imaging X rays and Gamma rays for which diffraction 
effects can be ignored. Therefore, if we can align the mask 
according to the ideal configuration and ignore the diffraction 
effects, then the system response is a cyclic convolution. 

Our FlatCam design does not necessarily yield a circulant 
system because our goal is to use the smallest possible mask- 
to-sensor distance, d. Furthermore, to build our prototypes, 
we used the smallest mask feature size. A, for which we 
could manually align the system on the optical table using 
mechanical posts. However, we demonstrate that by employing 
a scalable calibration procedure with a separable mask pattern 
in our FlatCam design, we can reconstruct quality images via 
simple computational algorithms. 
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A. Replacing lenses with computation 

Light from all points in the three dimensional scene gets 
modulated and diffracted by the mask pattern and recorded on 
the image sensor. Let us consider a surface, 5, in the scene 
that is completely visible to the sensor pixels and denote x as 
a vector of light distribution from all the points in S. We can 
then describe the sensor measurements, y, as 

y = + e. (1) 

^ denotes a transfer matrix whose ith column corresponds to 
an image that would form on the sensor if the scene contains 
a single light source of unit intensity at ith location in x. 
e denotes the sensor noise and any model mismatch. Note 
that if all the points in the scene, x, are at the same depth, 
then S becomes a plane parallel to the mask at distance d. 
Since the sensor pixels do not have a one-to-one mapping with 
the scene pixels, the matrix ^ will not resemble the identity 
matrix. Instead, each sensor pixel measures multiplexed light 
from multiple scene pixels, and each row of ^ indicates how 
strongly each scene pixel contributes to the intensity measured 
at a particular sensor pixel. In other words, any column in ^ 
denotes the image formed on the sensor if the scene contains 
a single, point light source at the respective location. 

Multiplexing generally results in an ill-conditioned system. 
Our goal is to design a mask that produces a matrix ^ 
that is well conditioned and hence can be stably inverted 
without excessive noise amplification. We now discuss how 
we navigate among three inter-related design decisions: the 
mask pattern, the placement d and feature size A of the mask, 
and the image recovery (demultiplexing) algorithm. 

B. Mask pattern 

The design of mask patterns plays an important role in 
coded-aperture imaging. An ideal pattern would maximize the 
light throughput while providing a well-conditioned scene- 
to-sensor transfer functions. In this regard, notable examples 
of mask patterns include URA, MURA, and pseudo noise 
patterns 0,0, p^ . URAs are particularly useful because of 
two key properties: (1) almost half of the mask is open, which 
helps with the signal-to-noise ratio; (2) the autocorrelation 
function of the mask is close to a delta function, which helps 
in image reconstruction. URA patterns are closely related to 
the Hadamard-Walsh pattern and the MLS that are maximally 
incoherent with their own cyclic shifts | [^ , | [35| , | [36| . How¬ 
ever, these properties only hold true if the mask and sensor 
can be aligned to create a perfect circulant system. 

In FlatCam design we consider three parameters to select the 
mask pattern: the light throughput, the complexity of system 
calibration and inversion, and the conditioning of the resulting 
multiplexing matrix 

Light throughput. In the absence of the mask, the amount of 
light that can be sensed by the bare sensor is limited only by 
its CRA. Since the photosensitive element in a CMOS/CCD 
sensor array is situated in a small cavity, a micro-lens array 
directly on top of the sensor is used to increase the light 
collection efficiency. In spite of this, only light rays up to 
a certain angle of incidence reach the sensor, and this is 


the fundamental light collection limit of that sensor. Placing 
an amplitude-modulating mask very close to (and completely 
covering) the sensor results in a light-collection efficiency that 
is a fraction of the fundamental light collection limit of the 
sensor. In our designs, half of the binary mask features are 
transparent, which halves our light collection ability compared 
to the maximum limit. 

To compare mask patterns with different types of transparent 
features, we present a simulation result in Fig. We simu¬ 
lated the transfer matrix, T>, for a one-dimensional system for 
four different types of masks and compared the singular values 
of their respective Ideally, we want a mask for which the 
singular values of ^ are large and they decay at a slow rate. We 
generated one-dimensional mask patterns using random binary 
patterns with 50% and 75% open features, uniform random 
pattern with entries drawn from the unit interval, [0,1], and 
an MLS pattern with 50% open features. We observed that 
MLS pattern outperforms random patterns and increasing the 
number of transparent features beyond 50% deteriorates the 
conditioning of the system. 

As described above, while it is true that the light collection 
ability of our FlatCam design is one-half of the maximum 
achievable with a particular sensor, the main advantage of 
the FlatCam design is that it allows us to use much larger 
sensor arrays for a given device thickness constraint, thereby 
significantly increasing the light collection capabilities of 
devices under thickness constraints. 

Computational complexity. The (linear) relationship be¬ 
tween the scene irradiance x and the sensor measurements 
y is contained in the multiplexing matrix Discretizing the 
unknown scene irradiance into NxN pixel units and assuming 
an M X M sensor array, is an x N‘^ matrix. Given 
a mask and sensor, we can obtain the entries of either by 
modeling the transmission of light from the scene to the sensor 
or through a calibration process. Clearly, even for moderately 
sized systems, ^ is prohibitively large to either estimate 
(calibration) or invert (image reconstruction), in general. For 
example, to describe a system with a megapixel resolution 
scene and a megapixel sensor array, ^ will contain on the 
order of 10^ x 10^ = 10^^ elements. 

One way to reduce the complexity of ^ is to use a separable 
mask for the FlatCam system. If the mask pattern is separable 
(i.e., an outer product of two one-dimensional patterns), then 
the imaging system in ([T]) can be rewritten as 

Y = + E, (2) 

where denote matrices that correspond to one¬ 

dimensional convolution along the rows and columns of the 
scene, respectively, A is an A x A matrix containing the 
scene radiance, U in an M x M matrix containing the sensor 
measurements, and E denotes the sensor noise and any model 
mismatch. For a megapixel scene and a megapixel sensor, 
and have only 10^ elements each, as opposed to 
10^^ elements in Similar idea has been recently proposed 
in p0| with the design of doubly Toeplitz mask. In our 
implementation, we also estimate the system matrices using 
a separate calibration procedure (see Sec. |III-D| ), which also 
becomes significantly simpler for a separable system. 
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To compare separable and non-separable mask patterns, we 
present a simulation result in Fig. 4b We simulated the $ 
matrices for a two-dimensional scene at 64 x 64 resolution 
using two separable and two non-separable mask patterns and 
compared the singular values of their respective For non- 
separable mask patters, we generated a random binary 2D 
pattern with an equal number of 0,1 entries and a uniform 2D 
pattern with entries drawn uniformly from the unit interval. 
For separable mask patterns, we generated an MLS pattern 
by first computing an outer product of two one-dimensional 
M-sequences with ±1 entries and setting all —Is to zero, and 
a separable binary pattern by computing the outer product of 
two one-dimensional binary patterns so that the number of Os 
and Is in the resulting 2D pattern is almost the same. Even 
though the non-separable binary pattern provides better singu¬ 
lar values compared to a separable MLS pattern, calibrating 
and characterizing such a system for high-dimensional images 
is beyond our current capabilities. 

Conditioning. The mask pattern should be chosen to make 
the multiplexing matrices and as numerically stable as 
possible, which ensures a stable recovery of the image X from 
the sensor measurements Y. Such and should have 
low condition numbers, i.e., a fiat singular value spectrum. 
For Toeplitz matrices, it is well known that, of all binary 
sequences, the so-called maximum length sequences, or M- 
sequences, have maximally fiat spectral properties p4| . There¬ 
fore, we use a separable mask pattern that is the outer product 
of two one-dimensional M-sequence patterns. However, be¬ 
cause of the inevitable non-idealities in our implementation, 
such as the limited sensor CRA and the larger than optimal 
sensor-mask distance due to the hot mirror, the actual r and 
^R we obtain using a separable M-sequence based mask do 
not achieve a perfectly fiat spectral profile. Nevertheless, as 
we demonstrate in our prototypes, the resulting multiplexing 
matrices enable stable image reconstruction in the presence of 
sensor noise and other non-idealities. All of the visible wave¬ 
length, color image results shown in this paper were obtained 
using normal, indoor ambient lighting and exposure times in 
10-20ms range, demonstrating that robust reconstruction is 
possible. 


C. Mask placement and feature size 

The multiplexing matrices ^r^^r describe the mapping of 
light emanating from the points in the scene to the pixels on 
the sensor. Consider light from a point source passing through 
one of the mask openings; its intensity distribution recorded 
at the sensor forms a point-spread function (PSF) that is due 
to both diffraction and geometric blurs. The PSF acts as a 
low-pass filter that limits the frequency content that can be 
recovered from the sensor measurements. The choice of the 
feature size and mask placement is dictated by the tradeoff 
between two factors: reducing the size of the PSF to minimize 
the total blur and enabling sufficient multiplexing to obtain a 
well-conditioned linear system. 

The total size of the PSF depends on the diffraction and 
geometric blurs, which in turn depend on the distance between 
the sensor and the mask, d, and the mask feature size, A. 


The size of the diffraction blur is approximately 2.44Ad/A, 
where A is the wavelength of light waves. The size of the 
geometric blur, however, is equal to the feature size A. Thus, 
the minimum blur radius for a fixed d is achieved when the 
two blur sizes are approximately equal: A = \/2.44Ad. One 
possible way to reduce the size of the combined PSF is to 
use larger feature size A. However, the extent of multiplexing 
within the scene pixels reduces as A increases. Therefore, if 
we aim to keep the amount of multiplexing constant, then the 
mask feature size A should shrink proportionally to the mask¬ 
sensor distance d. 

In practice, physical limits on the sensor-mask distance d 
or the mask feature size A can dictate the design choices. In 
our visible FlatCam prototype, for example, we use a Sony 
ICX285 sensor. The sensor has a 0.5 mm thick hot mirror 
attached to the top of the sensor, which restricts the potential 
spacing between the mask and sensor surface. Therefore, we 
attach the mask to the hot mirror, resulting in d ^ 500 yum 
(distance between the mask and the sensor surface). For a 
single pinhole at this distance, we achieve the smallest total 
blur size using a mask feature size of approximately A = 
30 /im, which is also the smallest feature size for which we 
were able to properly align the mask and sensor on the optical 
table. Of course, in future implementations, where the mask 
pattern is directly etched on top of the image sensor (direct 
fabrication) such practical constraints can be alleviated and 
we can achieve much higher resolution images by moving the 
mask closer to the sensor and reducing the mask feature size 
proportionally. 

To compare the effect of feature size on the conditioning of 
the sensing matrix, we present a simulation result in Fig.[4c| 
We simulated the ^ matrices for an MLS mask and a single 
pinhole for three different values of A = {5,10, 30} pm. For 
a pinhole pattern, we observe that reducing the pinhole size, 
A, degrades conditioning of ^ in two ways: (1) The largest 
singular value of ^ reduces as lesser light passes through the 
pinhole. (2) The singular values decay faster as the total blur 
increases because of smaller pinholes. For an MLS pattern, 
we observed that reducing the feature size. A, improves the 
conditioning of the system matrix However, because of the 
practical challenges we encountered while aligning mask and 
sensor, we do not yet have an experimental evidence to support 
the simulation results. 


D. Camera calibration 

We now provide the details of our calibration procedure 
for the separable imaging system modeled in Instead of 
modeling the convolution shifts and diffraction effects for a 
particular mask-sensor arrangement, we directly estimate the 
system matrices. 

To align the mask and sensor, we adjust their relative 
orientation such that a separable scene in front of the camera 
yields a separable image on the sensor. For a coarse alignment, 
we use a point light source, which projects a shadow of the 
mask onto the sensor, and align the horizontal and vertical 
edges on the sensor image with the image axes. For a fine 
alignment, we align the sensor with the mask while projecting 
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Fig. 4: Analysis of coded aperture system singular values by simulating sensing matrix with different masks placed at o? = 500 fim. (a) 
Increasing the number of transparent features beyond 50% may increase light collection, but it deteriorates the conditioning of the system, 
(b) Even though a non-separable pattern may provide better reconstruction compared to a separable pattern, calibrating and characterizing 
such a system requires a highly sophisticated calibration procedure, (c) A single pinhole-based system degrades as we reduce the feature 
size because lesser light reaches the sensor and the blur size increases. In contrast, a coded mask-based system improves as we reduce the 
feature size. 


horizontal and vertical stripes on a monitor or screen in the 
front of the camera. 

To calibrate a system that can recover N xN images X, we 
estimate the left and right matrices using the sensor 

measurements of 2N known calibration patterns projected on 
a screen as depicted in Fig.[^ Our calibration procedure relies 
on an important observation. If the scene X is separable, i.e., 
X = ab^ where a, b G then 

Y = = (i>ia)(i>flb)^. 

In essence, the image formed on the sensor is a rank-1 matrix, 
and by using a truncated singular value decomposition (SVD), 
we can obtain ^^a and ^nb up to a signed, scalar constant. 
We take N separable pattern measurements for calibrating 
each of and 

Specifically, to calibrate we capture N images 

{yi,...,yAr} corresponding to the separable patterns 
{Xi,..., Xiv} displayed on a monitor or screen. Each Xk is 
of the form X^ = , where G is a column of the 

orthogonal Hadamard matrix H of size N xN and 1 is an all- 
ones vector of length N. Since the Hadamard matrix consists 
of ±1 entries, we record two images for each Hadamard 
pattern; one with and one with — h/.!^ while setting 

the negative entries to zero in both cases. We then subtract the 
two sensor images to obtain the measurements corresponding 
to Xk. Let Yk = u/cV^ be the rank-1 approximation of the 
measurements Yk obtained via SVD, where the underlying 
assumption is that v ^ upto a signed, scalar constant. 

Then, we have 

[ui U2 ■ ■ ■ Uiv] = $L [hi h 2 ■ ■ ■ hjv] = ^lH, (3) 

and we compute as 

= [ui U2 • • •UAr]i^“\ (4) 

where . Similarly, we estimate by projecting 

N patterns of the form Ih^. 

Figure depicts the calibration procedure in which we 
projected separable patterns on a screen and recorded sensor 
measurements; the sensor measurements recorded from these 
patterns are re-ordered to form the left and right multiplexing 
operators shown in (b). 



Fig. 5: Calibration for measuring the left and right multiplexing 
matrices and corresponding to a separable mask, (a) Display 
separable patterns on a screen. The patterns are orthogonal, one¬ 
dimensional Hadamard codes that are repeated along either the 
horizontal or vertical direction, (b) Estimated left and right system 
matrices. 

A mask can only modulate light with non-negative trans¬ 
mittance values. M-sequences are defined in terms of ±1 
values and hence cannot be directly implemented in a mask. 
The masks we use in our prototype cameras are constructed 
by computing the outer-product of two M-sequences and 
then setting the resulting —1 entries to 0. This produces a 
mask that is optically feasible but no longer mathematically 
separable. We can resolve this issue in post-processing, since 
the difference between the measurements using the theoretical 
±1 separable mask and the optically feasible 0/1 mask is 
simply a constant bias term. In practice, once we acquire a 
sensor image, we correct it to correspond to a ±1 separable 
mask (described as Y in (|^) simply by forcing the column 
and row sums to zero. 

Recall that if we use a separable mask, then we can describe 
sensor measurements as V = If we turn on a single 

pixel in X, say Xij = 1, then the sensor measurements would 
be a rank-1 matrix where 0^, (jyj denote the ith and jth 

columns in respectively. Let us denote as a one¬ 
dimensional M-sequence of length N and as 

the separable mask that consists of ±1 entries; it is optically 
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component 2"^* component 3'''^ component 
Oi=0.704 02=0.079 03=0.044 



Fig. 6: Singular value decomposition of the point spread function for 
our proposed separable MLS mask before and after mean subtraction. 

infeasible because we cannot subtract light intensities. We 
created an optically feasible mask by setting all the —Is in 
^±1 to Os, which can be described as 

^0/1 = (^±1 + 

Therefore, if we have a single point source in the scene, 
the sensor image will be a rank-2 matrix. By subtracting 
the row and column means of the sensor image, we can 
convert the sensor response back to a rank-1 matrix. Only after 
this correction can we represent the superposition of sensor 
measurements from all the light sources in the scene using 
the separable system Y = We present an example 

of this mean subtraction procedure for an image captured with 
our prototype and a point light source in Fig. 

IV. Image reconstruction 

Given a set of M x M sensor measurements V, our ability 
to invert the system ([^ to recover the desired N x N image 
X primarily depends on the rank and the condition number of 
the system matrices 

If both and are well-conditioned, then we can 
estimate X by solving a simple least-squares problem 

^Ls = argmin - r|| 2 , (5) 

X 

which has the closed form solution: Xls = where 

and denote the pseudoinverse of and re¬ 
spectively. Consider the SVD of where Ur 

and Vr are orthogonal matrices that contain the left and right 
singular vectors and Si, is a diagonal matrix that contains the 
singular values. Note that this SVD need only be computed 
once for each calibrated system. The pseudoinverse can then 
be efficiently pre-computed as = VrYJ^^UJ^. 

When the matrices ^r, ^r are not well-conditioned or are 
under-determined (e.g., when we have fewer measurements 
M than the desired dimensionality of the scene N, as in 
compressive sensing fT5|-fT7|), some of the singular values 
are either very small or equal to zero. In these cases, the 
least-squares estimate Xrs suffers from noise amplihcation. 
A simple approach to reduce noise amplihcation is to add an 
£2 regularization term in the least-squares problem in ^ 

Xxik = argmin - F||i + r||X||2, (6) 

X 


where r > 0 is a regularization parameter. The solution of 
can also be explicitly written using the SVD of and ^r 
as we describe below. 

The solution of ([^ can be computed by setting the gradient 
of the objective in equal to zero and simplifying the 
resulting equation: 

^li^LX^l-Y)^R + TX = 0 
^I^lX^I^r + tX = ^IY^r. 

Replacing ^r and ^r with their SVD decompositions yields 
Vl^IvIXVrY\vI + tX = Vl^lUIYUr^rV^. 

Multiplying both sides of the equation with from the left 
and Vr from the right yields 

YIV^XVr^I + tVIXVr = YlUIYUrYr. 

Denote the diagonal entries of I]| and TR using the vectors 
cfr and cfr, respectively, to simplify the equations to 

VIXVr © {glctI) + tVIXVr = T.lUIYUrYr 
VIXVr 0 {aLal + rll^) = ErU^YUr^r 
VIXVr = {ELUlYURER).l{<jLal + 

where AQ B and A./B denote element-wise multiplication 
and division of matrices A and B, respectively. The solution 
of ([^ can hnahy be written as 

Xxik = VL[{ELUlYURER)./{<7Lal + t11^)]VJ. (7) 

Thus, once the SVDs of ^r and ^r are computed and stored, 
reconstruction of an x image from M x M sensor 
measurements involves a hxed cost of two M x N matrix 
multiplications, two N x N matrix multiplications, and three 
N X N diagonal matrix multiplications. 

In many cases, exploiting the sparse or low-dimensional 
structure of the unknown image signihcantly enhances re¬ 
construction performance. Natural images and videos exhibit 
a host of geometric properties, including sparse gradients 
and sparse coefficients in certain transform domains. Wavelet 
sparse models and total variation (TV) are widely used regu¬ 
larization methods for natural images | [37| , p8| . By enforcing 
these geometric properties, we can suppress noise amplihca- 
tion as well as obtain unique solutions. A pertinent example 
for image reconstruction is the sparse gradient model, which 
can be represented in the form of the following total-variation 
(TV) minimization problem: 

Xxv = argimn + A||X||xv- (8) 

The term ||-V||tv denotes the TV of the image X given by the 
sum of magnitudes of the image gradients. Given the scene 
X as a 2D image, i.e., X{u,v), we can dehne Gu = DuX 
and Gy = DyX as the spatial gradients of the image along 
the horizontal and vertical directions, respectively. The total 
variation of the image is then dehned as 

IAIItv = Y VGu{u,v)‘^ + Gy{u,vY. 

U,V 

Minimizing the TV as in ^ produces images with sparse 
gradients. The optimization problem ^ is convex and can be 

















255-length M-sequence 


efficiently solved using a variety of methods. Many extensions 
and performance analyses are possible following the recently 
developed theory of compressive sensing. 

In addition to simplifying the calibration task, separability 
of the coded mask also significantly reduces the computational 
burden of image reconstruction. Iterative methods for solving 
the optimization problems described above require the re¬ 
peated application of the multiplexing matrix and its transpose. 
Continuing our numerical example from above, for a non- 
separable, dense mask, both of these operations would require 
on the order of 10^^ multiplications and additions for mega¬ 
pixel images. With a separable mask, however, the application 
of the forward and transpose operators requires only on the 
order of 2 x 10^ scalar multiplications and additions—a 
tremendous reduction in computational complexity. 

V. Experimental results 

We present results on two prototypes. The first uses a 
Silcon-based sensor to sense in visible wavelengths and the 
second uses an InGaAs sensor for sensing in short-wave 
infrared. 


A. Visible wavelength FlatCam prototype 

We built this FlatCam prototype as follows. 

Image sensor: We used a Sony ICX285 CCD color sensor 
that came inside a Point Grey Grasshopper 3 camera (model 
GS3-U3-14S5C-C). The sensor has 1036 x 1384 pixels, each 
6.45/im wide, arranged in an RGB Bayer pattern. The physical 
size of the sensor array is approximately 6.7mm x 8.9mm. 
Mask material: We used a custom-made chrome-on-quartz 
photomask that consists of a fused quartz plate, one side of 
which is covered with a pattern defined using a thin chrome 
film. The transparent regions of the mask transmit light, while 
the chrome film regions of the mask block light. 

Mask pattern and resolution: We created the binary mask 
pattern as follows. We first generated a length-255 M-sequence 
consisting of ±1 entries. The actual 255-length M-sequence is 
shown in Fig. We repeated the M-sequence twice to create 
a 510-length sequence and computed the outer product with 
itself to create a 510 x 510 matrix. Since the resulting outer 
product consist of ±1 entries, we replaced every —1 with a 0 
to create a binary matrix that is optically feasible. An image 
showing the final 510 x 510 mask pattern is shown in Fig. 
We printed a mask from the 510 x 510 binary matrix such 
that each element corresponds to a A = 30/im square box 
(transparent, if 1; opaque, if 0) on the printed mask. Images 
of the pattern that we used for the mask and the printed mask 
are presented in Fig. The final printed mask is a square 
approximately 15.3mm on a side and covers the entire sensor 
area. Even though the binary mask is not separable as is, we 
can represent the sensor image using the separable system 
described in ^ by subtracting the row and column mean from 
the sensor images (see Sec. |III-D for details on calibration). 
Mask placement: We opened the camera body to expose the 
sensor surface and placed the quartz mask on top of it using 
mechanical posts such that the mask touches the protective 
glass (hot mirror) on top of the sensor. Thus the distance 
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Fig. 8: Masks used in both our visible and SWIR FlatCam 
prototypes. M-sequences with ±1 entries that we used to create the 
binary masks for (a) the visible camera and (b) the SWIR camera. 
Binary masks created from the M-sequences for (c) the visible camera 
and (d) the SWIR camera. 




between the mask and the sensor d is determined by thickness 
of the glass, which for this sensor is 0.5mm. 

Data readout and processing: We adjusted the white balance 
of the sensor using Point Grey FlyCapture software and 
recorded images in 8-bit RGB format using suitable exposure 
and frame rate settings. In most of our experiments, the 
exposure time was fixed at 10ms, but we adjusted it according 
to the scene intensity to avoid excessively bright or dark 
sensor images. For the static scenes we averaged 20 sensor 
images to create a single set of measurements to be used for 
reconstruction. 

We reconstructed 512x512 RGB images from our prototype 
using 512 X 512 RGB sensor measurements. Since the sensor 
has 1086 X 1384 pixels, we first cropped and uniformly 
subsampled the sensor image to create an effective 512 x 512 
color sensor image; then we subtracted the row and column 
means from that image. The resulting image corresponds 
to the measurements described by which we used to 
reconstruct the desired image X. Some example sensor images 
and corresponding reconstruction results are shown in Fig. |7] 
In these experiments, we solved an .^ 2 -i'egularized least-squares 
problem in in'. followed by BM3D denoising p^ . Solving the 
least-squares recovery problem for a single 512 x 512 image 
using pre-computed SVD requires a fraction of a second on a 
standard laptop computer. 

We present a comparison of three different methods for 
reconstructing static scenes in Fig. We used MATFAB 
for solving all the computational problems. For the results 
presented in Fig. we recorded sensor measurements while 
displaying test images on an FCD monitor placed 28cm away 
from the camera and by placing various objects in front of the 
camera in ambient lighting. 

We used three methods for reconstructing the scenes from 
the sensor measurements: 

1) We computed and stored the SVD of and solved 

the ^ 2 -regularized problem in •ij as described in 0- 
The average computation time for the reconstruction of a 
single 512 x 512 image on a standard laptop was 75ms. 
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Fig. 7: Visible FlatCam prototype and results, (a) Prototype consists of a Sony ICX285 sensor with a separable M-sequence mask placed 
approximately 0.5mm from the sensor surface, (b) The sensor measurements are different linear combinations of the light from different 
points in the scene, (c) Reconstructed 512 x 512 color images by processing each color channel independently. 


The results of SVD-based reconstruction are presented in 
Fig. 1^ (a). The reconstructed images are slightly noisy, 
with details missing around the edges. 

2) To reduce the noise in our SVD-estimated images, we 
applied BM3D denoising to each reconstructed image. 
The results of SVD/BM3D reconstruction are presented 
in Fig. (b). The average computation time for BM3D 
denoising of a single image was 10s. 

3) To improve our results further, we reconstructed by 
solving the TV minimization problem The results of 
TV reconstruction are presented in Fig. [^c). While, as 
expected, the TV method recovers more details around 
the edges, the overall reconstruction quality is not ap¬ 
preciably very different from SVD-based reconstruction. 
The computation time of TV, however, increases to 75s 
per image. 

To demonstrate the flexibility of our FlatCam design, we 
also captured and reconstructed dynamic scenes at typical 
video rates. We present selected frames ^from two videos in 
Fig. [T^ The images presented in Fig. p"0| \ are reconstructed 
frames from a video of a hand making counting gestures, 
recorded at 30 frames per second. The images presented in 
Fig. [T^ are reconstructed frames from a video of a toy bird 
dipping its head in water, recorded at 10 frames per second. 
In both cases, we reconstructed each video frame at512x512 
pixel resolution by solving using the SVD-based method 
described in (|^, followed by BM3D denoising. 


^Complete videos are available at http : / /bit. ly/FlatCam. 


B. SWIR FlatCam prototype 

This FlatCam prototype consists of a Goodrich 320KTS- 
1.7RT InGaAs sensor with a binary separable M-sequence 
mask placed at distance d = 5mm. The sensor-mask distance 
is large in this prototype because of the protective casing on 
top of the sensor. We used a feature size of A = 100/im for 
the mask, which was constructed using the same photomask 
process as for the visible camera. The sensor has 256 x 300 
pixels, each of size w = 25/im, but because of the large 
sensor-to-mask distance and mask feature size, the effective 
system resolution is limited. Therefore, we binned 4x4 pixels 
on the sensor (and cropped a square region of the sensor) 
to produce sensor measurements of effective size 64 x 64. 
We reconstructed images with the same 64 x 64 resolution; 
example results are shown in Fig. [m 

VI. Discussions and Conclusions 

The mask-based, lens-free FlatCam design proposed here 
can have a signiflcant impact in an important emerging area of 
imaging, since high-performance, broad-spectrum cameras can 
be monolithically fabricated instead of requiring cumbersome 
post-fabrication assembly. The thin form factor and low cost 
of lens-free cameras makes them ideally suited for many 
applications in surveillance, large surface cameras, flexible 
or foldable cameras, disaster recovery, and beyond, where 
cameras are either disposable resources or integrated in flat 
or flexible surfaces and therefore have to satisfy strict thick¬ 
ness constraints. Emerging applications like wearable devices, 
intemet-of-things, and in-vivo imaging could also benefit from 
the FlatCam approach. 
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Fig. 9: Images reconstructed at 512 x 512 resolution using the visible FlatCam prototype and three different reconstruction methods, (a) 
SVD-based solution of ID; average computation time per image = 75ms. (b) SVD/BM3D reconstruction; average computation time per 
image = 10s. (c) Total variation (TV) based reconstruction; average computation time per image = 15s. 



64x64 Reconstructed images 


Fig. 11: Short wave infrared (SWIR) FlatCam prototype and results, 
(a) Prototype consists of a Goodrich 320KTS-1.7RT sensor with a 
separable M-sequence mask placed approximately 5mm from the 
detector surface, (b) Reconstructed 64 x 64 images. 

A. Advantages of FlatCam 

We make key changes in our FlatCam design to move 
away from the cube-like form-factor of traditional lens-based 
and coded aperture cameras while retaining their high light 
collection abilities. We move the coded mask extremely close 
to the image sensor, which results in a thin, flat camera. We use 
a binary mask pattern with 50% transparent features, which. 


when combined with the large surface area sensor, enables 
large light collection capabilities. We use a separable mask 
pattern, similar to the prior work in coded aperture imaging 
|TQ| , which enables simpler calibration and reconstruction. 
The result is a radically different form factor from previous 
camera designs that can enable integration of FlatCams into 
large surfaces and flexible materials such as wallpaper and 
clothes that require thin, flat, and lightweight materials [ [40| . 

Flat form factor. The flatness of a camera system can be 
measured by its thickness-to-width ratio (TWR). The form 
factor of most cameras, including pinhole and lens-based cam¬ 
eras, conventional coded-aperture systems Q, and miniature 
diffraction grating-based systems is cube-like; that is, the 
thickness of the device is of the same order of magnitude as 
the sensor width, resulting in TWR ^ 1. Cube-like camera 
systems suffer from a signiflcant limitation: if we reduce 
the thickness of the camera by an order of magnitude while 
preserving its TWR, then the area of the sensor drops by two 
order of magnitude. This results in a two orders of magnitude 
reduction in light collection ability. In contrast, FlatCams are 
endowed with flat form factors; by design, the thickness of 
the device is an order of magnitude smaller than the sensor 
width. Thus, for a given thickness constraint, a FlatCam can 
utilize a large sensing surface for light collection. In our visible 
FlatCam prototype, for example, the sensor-to-mask distance 
is 0.5mm, while the sensor width is about 6.7mm, resulting 
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Fig. 10: Dynamic scenes captured by a FlatCam at video rates and reconstructed at 512 x 512 resolution, (a) Frames from the video of 
hand gestures captured at 30 frames per second, (b) Frames from the video of a toy bird captured at 10 frames per second. 


in TWR « 0.075. While on-chip lensless microscopes can 
also achieve such low TWRs, such systems require complete 
control of the illumination and the subject to be less than 
1 mm from the camera p0| . We are unaware of any other 
far-field imaging system that has a comparable TWR of the 
FlatCam while providing reasonable light capture and imaging 
resolution. 

High light collection. The light collection ability of an 
imaging system depends on two factors: its sensor area and the 
square of its numerical aperture. Conventional sensor pixels 
typically have an angular response of 40-60 degrees, which 
is referred to as the sensors chief ray angle (CRA). The total 
amount of light that can be sensed by a sensor is often limited 
by the CRA, which in turn determines the maximum allowable 
numerical aperture of the system. Specifically, whether we 
consider the best lens-based camera, or even a fully exposed 
sensor, the cone of light that can enter a pixel is determined 
by the CRA. 

Consider an imaging system with a strict constraint on 
the device thickness T^ax- The light collection L of such 
an imaging device can be described as L oc W‘^N\, where 
W denotes the width of the (square) sensor and Na denotes 
the numerical aperture. Since FFmax = T^^ax/TWR, we have 
L (X W‘^N\ < (A^^Tmax/TWR)^. Thus, given a thickness 
constraint T^ax, the light collection of an imaging system is 
directly proportional to the square of the numerical aperture 
and inversely proportional to the square of its TWR. Thus, 
smaller TWR leads to better light collection. 

The numerical aperture of our prototype FlatCams is limited 
by the CRA of the sensors. Moreover, half of the features in 
our mask are opaque and block one half of the light that would 
have otherwise entered the sensor. Realizing that the numerical 
aperture of such a FlatCam is reduced only by a factor of \/2 


compared to an open aperture, yet its TWR is reduced by an 
order of magnitude leads to the conclusion that a FlatCam 
collects approximately two orders of magnitude more light 
than a cube-like miniature camera of the same thickness. 

B. Limitations of FlatCam 

FlatCam is a radical departure from centuries of research 
and development in lens-based cameras, and as such this 
radical departure has its own limitations. 

Achievable image/angular resolution. Our current proto¬ 
types have low spatial resolution which is attributed to two 
factors. First, it is well known that angular resolution of 
pinhole cameras and coded aperture cameras decreases when 
the mask is moved closer to the sensor 0 - This results 
in an implicit tradeoff between the achievable thickness and 
the achievable resolution. Second, the image recorded on the 
image sensor in a FlatCam is a linear combination of the 
scene radiance, where the multiplexing matrix is controlled by 
the mask pattern and distance between mask and sensor. This 
means that recovering the scene from sensor measurements 
requires demultiplexing. Noise amplification is an unfortunate 
outcome of any linear demultiplexing based system. While 
the magnitude of this noise amplification can be controlled by 
careful design of the mask patterns, they cannot be completely 
eliminated in FlatCam. In addition, the singular values of the 
linear system are such that the noise amplification for higher 
spatial frequencies is larger, which consequently limits the 
spatial resolution of the recovered image. We are currently 
working on several techniques to improve the spatial resolution 
of the recovered images. 

Direct-view and real-time operation. In traditional lens- 
based cameras, the image sensed by the image sensor is 
the photograph of the scene. In FlatCam, a computational 
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algorithm is required to convert the sensor measurements 
into a photograph of the scene. This results in a time-lag 
between the sensor acquisition and the image display, a time- 
lag that depends on processing time. Currently, our SVD- 
based reconstruction operates at near real-time (about 10 fps) 
resulting in about a 100 ms delay between capture and display. 
While this may be acceptable for certain applications, there 
are many other applications such as augmented reality and 
virtual reality, where such delays are unacceptable. Order 
of magnitude improvements in processing times are required 
before FlatCam becomes amenable to such applications. 
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